For this lab, we will work with gene expression data measured on breast and ovary tumors. The data originally comes from http://gemler.fzv.uni-mb.si/index.php but has been downsized so that it is easier to work with in our labs.
The data is similar to the Endometrium vs. Uterus cancer we have been working with for several weeks.
The data we will work with contains the expression of 3,000 genes, measured for 344 breast tumors and 198 ovary tumors.
In [ ]:
import numpy as np # numeric python
# scikit-learn (machine learning)
from sklearn import preprocessing
from sklearn import decomposition
In [ ]:
# Graphics
%pylab inline
In [ ]:
Question What are the dimensions of X? How many samples come from ovary tumors? How many come from breast tumors?
PCA documentation: http://scikit-learn.org/0.17/modules/decomposition.html#pca and http://scikit-learn.org/0.17/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA
In [ ]:
In [ ]:
pca = decomposition.PCA(n_components=30)
pca.fit(X_norm)
Question: Plot the fraction of variance explained by each component. Use pca.explained_variance_ratio_
In [ ]:
# TODO
plt.xlim([0, 29])
plt.xlabel("Number of PCs", fontsize=16)
plt.ylabel("Fraction of variance explained", fontsize=16)
Question: Use pca.transform to project the data onto its principal components. How is pca.explained_variance_ratio_ computed? Check this is the case by computing it yourself.
In [ ]:
Question: Plot the data in the space of the two first components; color breast samples in blue and ovary samples in orange. What do you observe? Can you separate the two classes visually?
In [ ]:
for color_name, tissue, tissue_name in zip(['blue', 'orange'], [-1, 1], ['breast', 'ovary']):
plt.scatter(#TODO,
c=color_name, label=tissue_name)
plt.legend(loc=(1.1, 0), fontsize=14)
plt.xlabel("PC 1", fontsize=16)
plt.ylabel("PC 2", fontsize=16)
Bonus question: Rather than visually, actually try to separate the two classes by a logistic regression line (using only the two first PCs). Plot the decision boundary. You can draw inspiration from http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html#sphx-glr-auto-examples-linear-model-plot-iris-logistic-py for the plot.
In [ ]:
In [ ]:
Question: Repeat the PCA procedure on the data without outliers. Can you now visually separate the two tissues?
In [ ]:
Question: How many PCs do you think are sufficient to represent your data? What do you expect will happen if you use the projection of the gene expressions on these PCs and run a cross-validation of a classification algorithm? Try it out. Is there a risk of overfitting when you do this?
In [ ]:
Question: Working on the original features, how do you expect your decision boundary (and AUC) to change, for different algorithms, depending on whether or not the outliers are included in the data? Try it out.
In [ ]: